Experiment tracking by AymenFJA · Pull Request #86 · radical-cybertools/ROSE

AymenFJA · 2026-03-10T16:56:53Z

This PR introduces experiment tracking capabilities:

Core vanilla file-based.
Integration: ClearML
Integration: MLFlow

This PR only supports ROSE tracking capabilities on the learner level and specifically the outer loop of the learning approach.

MLFLOW

CLEAR ML

1-Base Tracking capabilites 2-PNP approach for trackers 3-Native File based exp. tracker 4-MLFLOW exp. tracker 5-ClearML exp. tracker 6-Extend examples 7-Extend docs

1-Test suite for tracking 2-Examples for tracking 3-Update docs

1-Modify the approach so it aligns with ROSE working on OUTER loop of the learning and not the INNER Loop 2-Update Docs 3-Update Examples 4-Update Tests

gemini-code-assist · 2026-03-10T17:01:36Z

Warning

Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting /gemini summary.

AymenFJA · 2026-03-10T17:03:14Z

/gemini summary

gemini-code-assist · 2026-03-10T17:06:20Z

Summary of Changes

This pull request introduces a robust and flexible experiment tracking system to ROSE, significantly enhancing its observability capabilities. By implementing a pluggable TrackerBase protocol, it cleanly separates experiment logging concerns from the core learning algorithms. This allows users to effortlessly integrate popular platforms like ClearML and MLflow, or utilize a simple file-based logger, with minimal code changes. The system automatically captures pipeline configurations, iteration metrics, and stop reasons, promoting reproducibility and providing richer insights into learning workflows, particularly beneficial in HPC environments prone to job preemption.

Highlights

Pluggable Experiment Tracking System: Introduced a new TrackerBase protocol and PipelineManifest to enable a flexible and extensible experiment tracking system, allowing automatic logging of pipeline events without modifying the core learning loop.
ClearML Integration: Added a native integration for ClearML via ClearMLTracker, providing seamless experiment tracking with ClearML, including support for parallel learner comparison.
MLflow Integration: Implemented a native integration for MLflow via MLflowTracker, enabling automatic logging of parameters, metrics, and tags, and deprecating the previous manual wiring approach.
HPC-Safe File-Based Tracking: Included a basic HPC_FileTracker for robust, append-only, and preemption-safe logging of experiment data to JSON Lines files, suitable for High-Performance Computing environments.
Documentation and Examples: Updated the documentation with a comprehensive user guide for the new tracking system and detailed guides for ClearML and MLflow integrations, alongside new example scripts demonstrating the add_tracker() approach.
Core Learner Refactoring: Refactored SequentialActiveLearner, ParallelUQLearner, and SequentialRLLearner classes to incorporate tracker notification callbacks at key lifecycle points, ensuring consistent and automatic event dispatching.
Dependency Management: Added optional dependencies for mlflow and clearml in pyproject.toml, allowing users to install only the required tracking integrations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.gitignore
- Added *.db to ignore database files.
docs/index.md
- Updated the main documentation to highlight preemption-safe HPC operation and clean separation of control and observability.
docs/integrations/clearml.md
- Added new documentation detailing the ClearML integration, including quick start, automatic logging, parallel learner comparison, and extension points.
docs/integrations/mlflow.md
- Rewrote the MLflow integration documentation to introduce the MLflowTracker and deprecate the manual wiring approach.
docs/user-guide/tracking.md
- Added a new user guide explaining the pluggable tracking system, TrackerBase protocol, PipelineManifest, and built-in trackers.
examples/integrations/mlflow/README.md
- Removed the old MLflow integration README.
examples/integrations/mlflow/mlflow_rose.py
- Removed the old manual MLflow integration example script.
examples/integrations/tracking/README.md
- Added a new README to introduce the tracking examples.
examples/integrations/tracking/basic.py
- Added a new example demonstrating the HPC_FileTracker for basic file-based logging.
examples/integrations/tracking/clearml/run_me.py
- Added a new example showcasing the ClearML tracker with parallel ensemble UQ.
examples/integrations/tracking/mlflow/run_me_tracker.py
- Added a new example demonstrating the MLflow tracker with an adaptive kernel schedule.
mkdocs.yml
- Updated the documentation navigation to include the new tracking user guide and ClearML integration.
pyproject.toml
- Added optional dependencies for mlflow, clearml, and a combined tracking extra.
rose/init.py
- Exported PipelineManifest and TrackerBase for public API access.
rose/al/active_learner.py
- Modified the start method to notify registered trackers at iteration completion and loop exit.
rose/integrations/clearml_tracker.py
- Added a new ClearMLTracker class implementing the TrackerBase protocol for ClearML integration.
rose/integrations/mlflow_tracker.py
- Added a new MLflowTracker class implementing the TrackerBase protocol for MLflow integration.
rose/learner.py
- Implemented the add_tracker method, _build_pipeline_manifest for pipeline introspection, and _notify_trackers_iteration/_notify_trackers_stop for dispatching events.
rose/rl/reinforcement_learner.py
- Modified the start method to notify registered trackers at iteration completion and loop exit.
rose/tracking.py
- Added a new module defining the TrackerBase protocol, TaskManifest, CriterionManifest, and PipelineManifest dataclasses.
tests/integration/tracking/test_clearml_tracker.py
- Added new integration tests to verify the functionality of ClearMLTracker.
tests/integration/tracking/test_mlflow_tracker.py
- Added new integration tests to verify the functionality of MLflowTracker.
tests/unit/tracking/test_tracker_core.py
- Added new unit tests for the TrackerBase protocol and manifest data structures.
tests/unit/tracking/test_tracker_interface.py
- Added new unit tests to verify the add_tracker and notification mechanisms within the learner.

Activity

A bot attempted to summarize the PR but failed due to high traffic.
The author re-requested the summary.

AymenFJA · 2026-03-10T17:09:30Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request introduces a well-designed and comprehensive experiment tracking system to ROSE, featuring a new TrackerBase protocol and integrations for file-based logging, MLflow, and ClearML. While the design, documentation, and test coverage are strong, a critical security vulnerability has been identified: a high-severity insecure deserialization issue in example scripts due to pickle usage on predictable file paths in /tmp. Additionally, there's a medium-severity concern regarding potential leakage of sensitive information from task decorators to tracking services. Addressing these security concerns is paramount, and further areas for improvement are detailed in specific comments.

gemini-code-assist · 2026-03-10T17:15:29Z

examples/integrations/tracking/basic.py

+
+def load_state() -> dict:
+    with open(DATA_FILE, "rb") as f:
+        return pickle.load(f)


The use of pickle.load() on a file located in a world-writable directory (/tmp) with a predictable filename is insecure. A local attacker could place a malicious pickle file at this location to achieve arbitrary code execution when this example script is run. This is particularly risky in shared environments like HPC clusters. Consider using a safer serialization format like JSON or ensuring the file is created with secure permissions and a non-predictable name.

gemini-code-assist · 2026-03-10T17:15:29Z

examples/integrations/tracking/clearml/run_me.py

+def load_state(name: str) -> dict:
+    path = Path(str(DATA_FILE).format(name=name))
+    with open(path, "rb") as f:
+        return pickle.load(f)


The use of pickle.load() on a file located in a world-writable directory (/tmp) with a predictable filename is insecure. A local attacker could place a malicious pickle file at this location to achieve arbitrary code execution when this example script is run. Consider using a safer serialization format like JSON or ensuring the file is created with secure permissions and a non-predictable name.

gemini-code-assist · 2026-03-10T17:15:29Z

examples/integrations/tracking/mlflow/run_me_tracker.py

+
+def load_state() -> dict:
+    with open(DATA_FILE, "rb") as f:
+        return pickle.load(f)


The use of pickle.load() on a file located in a world-writable directory (/tmp) with a predictable filename is insecure. A local attacker could place a malicious pickle file at this location to achieve arbitrary code execution when this example script is run. Consider using a safer serialization format like JSON or ensuring the file is created with secure permissions and a non-predictable name.

gemini-code-assist · 2026-03-10T17:15:29Z

rose/integrations/clearml_tracker.py

+            for k, v in task_manifest.decor_kwargs.items():
+                params[f"task/{task_key}/{k}"] = v


The tracker logs all keyword arguments from task decorators (decor_kwargs) to ClearML. If a user passes sensitive information (e.g., API keys, tokens) to a decorator, it will be logged and stored in the tracking service. It is recommended to filter out common sensitive keys before logging.

Suggested change

for k, v in task_manifest.decor_kwargs.items():

params[f"task/{task_key}/{k}"] = v

for k, v in task_manifest.decor_kwargs.items():

if k.lower() not in {"api_key", "password", "secret", "token"}:

params[f"task/{task_key}/{k}"] = v

gemini-code-assist · 2026-03-10T17:15:29Z

rose/integrations/mlflow_tracker.py

+            for k, v in task_manifest.decor_kwargs.items():
+                params[f"task.{task_key}.{k}"] = v


The tracker logs all keyword arguments from task decorators (decor_kwargs) to MLflow. If a user passes sensitive information (e.g., API keys, tokens) to a decorator, it will be logged and stored in the tracking service. It is recommended to filter out common sensitive keys before logging.

Suggested change

for k, v in task_manifest.decor_kwargs.items():

params[f"task.{task_key}.{k}"] = v

for k, v in task_manifest.decor_kwargs.items():

if k.lower() not in {"api_key", "password", "secret", "token"}:

params[f"task.{task_key}.{k}"] = v

gemini-code-assist · 2026-03-10T17:15:29Z

docs/integrations/mlflow.md

+        def on_stop(self, final_state, reason: str) -> None:
+            super().on_stop(final_state, reason)
+            if final_state and reason in ("criterion_met", "max_iter_reached"):
+                model = load_model(final_state.get("checkpoint_path"))
+                mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
+    ```


The example for extending MLflowTracker has a small bug. It calls super().on_stop() before logging the model artifact. The base on_stop method calls mlflow.end_run(), which terminates the MLflow run. Any subsequent calls to log artifacts will either fail or start a new, separate run.

To ensure all logging happens within the same active run, the super().on_stop() call should be moved to the end of the method, after the artifact has been logged. This pattern is correctly used in the ClearMLTracker extension example.

Suggested change

def on_stop(self, final_state, reason: str) -> None:

super().on_stop(final_state, reason)

if final_state and reason in ("criterion_met", "max_iter_reached"):

model = load_model(final_state.get("checkpoint_path"))

mlflow.sklearn.log_model(model, artifact_path="surrogate_model")

```

class MLflowArtifactTracker(MLflowTracker):

def on_stop(self, final_state, reason: str) -> None:

if final_state and reason in ("criterion_met", "max_iter_reached"):

model = load_model(final_state.get("checkpoint_path"))

mlflow.sklearn.log_model(model, artifact_path="surrogate_model")

super().on_stop(final_state, reason)

gemini-code-assist · 2026-03-10T17:15:29Z

rose/al/active_learner.py

+        _stop_reason = "max_iter_reached"
+        try:
+            async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
+                self._notify_trackers_iteration(state)
+                yield state
+        finally:
+            self._notify_trackers_stop(self._iteration_state, _stop_reason)


The _stop_reason for the ParallelActiveLearner is initialized to "max_iter_reached" and is never updated. If the user breaks out of the async for loop, the finally block will execute and _notify_trackers_stop will be called with the incorrect reason. It should report "stopped" in this case.

A better pattern would be to default the reason to "stopped" and only update it to "max_iter_reached" if the loop completes fully.

Suggested change

_stop_reason = "max_iter_reached"

try:

async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):

self._notify_trackers_iteration(state)

yield state

finally:

self._notify_trackers_stop(self._iteration_state, _stop_reason)

_stop_reason = "stopped"

try:

async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):

self._notify_trackers_iteration(state)

yield state

_stop_reason = "max_iter_reached"

finally:

self._notify_trackers_stop(self._iteration_state, _stop_reason)

gemini-code-assist · 2026-03-10T17:15:29Z

rose/rl/reinforcement_learner.py

+        _stop_reason = "max_iter_reached"
+        try:
+            async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
+                self._notify_trackers_iteration(state)
+                yield state
+        finally:
+            self._notify_trackers_stop(self._iteration_state, _stop_reason)


The _stop_reason for the ParallelReinforcementLearner is initialized to "max_iter_reached" and is never updated. If the user breaks out of the async for loop, the finally block will execute and _notify_trackers_stop will be called with the incorrect reason. It should report "stopped" in this case.

A better pattern would be to default the reason to "stopped" and only update it to "max_iter_reached" if the loop completes fully.

Suggested change

_stop_reason = "max_iter_reached"

try:

async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):

self._notify_trackers_iteration(state)

yield state

finally:

self._notify_trackers_stop(self._iteration_state, _stop_reason)

_stop_reason = "stopped"

try:

async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):

self._notify_trackers_iteration(state)

yield state

_stop_reason = "max_iter_reached"

finally:

self._notify_trackers_stop(self._iteration_state, _stop_reason)

gemini-code-assist · 2026-03-10T17:15:29Z

rose/uq/uq_active_learner.py

+        _stop_reason = "max_iter_reached"
+        try:
+            async for state in _stream_parallel([make_run_fn(name) for name in learner_names]):
+                self._notify_trackers_iteration(state)
+                yield state
+        finally:
+            self._notify_trackers_stop(self._iteration_state, _stop_reason)


The _stop_reason for the ParallelUQLearner is initialized to "max_iter_reached" and is never updated. If the user breaks out of the async for loop, the finally block will execute and _notify_trackers_stop will be called with the incorrect reason. It should report "stopped" in this case.

A better pattern would be to default the reason to "stopped" and only update it to "max_iter_reached" if the loop completes fully.

Suggested change

_stop_reason = "max_iter_reached"

try:

async for state in _stream_parallel([make_run_fn(name) for name in learner_names]):

self._notify_trackers_iteration(state)

yield state

finally:

self._notify_trackers_stop(self._iteration_state, _stop_reason)

_stop_reason = "stopped"

try:

async for state in _stream_parallel([make_run_fn(name) for name in learner_names]):

self._notify_trackers_iteration(state)

yield state

_stop_reason = "max_iter_reached"

finally:

self._notify_trackers_stop(self._iteration_state, _stop_reason)

1-Isolate the track parameters from the decor kwargs and uses log_params 2-Fix examples to be more realistic 3-Update tests 4-Test with MLFlow and ClearML UI

1-Fix 1 — ClearML series names: ClearMLTracker adds an optional learner_names parameter, improves series naming, and fixes a bug where learner_id=0 was incorrectly mapped to value. 2-Fix 2 — Non-numeric config logging: Add string config logging for MLflow (via tags) and ClearML (via connected hyperparameters) and move _TASK_NAMES to module level to avoid recreating it each iteration.

AymenFJA added 3 commits March 9, 2026 21:34

This commit adds:

4464421

1-Base Tracking capabilites 2-PNP approach for trackers 3-Native File based exp. tracker 4-MLFLOW exp. tracker 5-ClearML exp. tracker 6-Extend examples 7-Extend docs

This commit adds:

caac48a

1-Test suite for tracking 2-Examples for tracking 3-Update docs

This commit:

cd05d38

1-Modify the approach so it aligns with ROSE working on OUTER loop of the learning and not the INNER Loop 2-Update Docs 3-Update Examples 4-Update Tests

AymenFJA self-assigned this Mar 10, 2026

AymenFJA added raas ROSE As A Service Q1 labels Mar 10, 2026

AymenFJA changed the title ~~Feature/experiment tracking~~ Experiment tracking Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

AymenFJA added 4 commits March 10, 2026 19:29

This commit:

e75046f

1-Isolate the track parameters from the decor kwargs and uses log_params 2-Fix examples to be more realistic 3-Update tests 4-Test with MLFlow and ClearML UI

consider per iter config in tracker

1279d14

This commit improve stop reason for each tracker

811862b

AymenFJA merged commit 5db6c20 into main Mar 12, 2026
10 checks passed

AymenFJA deleted the feature/experiment_tracking branch March 12, 2026 13:48

		for k, v in task_manifest.decor_kwargs.items():
		params[f"task/{task_key}/{k}"] = v

		for k, v in task_manifest.decor_kwargs.items():
		params[f"task.{task_key}.{k}"] = v

Conversation

AymenFJA commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MLFLOW

CLEAR ML

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

AymenFJA commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Uh oh!

AymenFJA commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AymenFJA commented Mar 10, 2026 •

edited

Loading